Data management

Author

Tony Duan

1. Introduction to Data Management

Effective data management is crucial for reproducible and collaborative data science. It involves organizing, storing, and sharing data in a way that is efficient, secure, and easy to manage. In R, the pins package provides a powerful and straightforward solution for this challenge.

The pins package allows you to “pin” data objects—like data frames, models, or plots—to a “board.” A board is a location where you store your pins, which can be a local folder, a network drive, or a cloud service like Amazon S3, Google Cloud Storage, or Posit Connect. This makes it easy to share and access data across different projects, colleagues, or even between R and Python environments.

The key idea is to treat data like a package. You can publish data to a board, and then others (or your future self) can install and use that data with a simple command, without worrying about file paths or where the data is stored.

2. Getting Started with `pins`

First, you need to install the package from CRAN.

Code

install.packages("pins")

Then, load the necessary libraries. We will use pins and tidyverse.

Code

library(pins)
library(tidyverse)

3. Creating a Board

A board is the storage location for your pins. For this example, we will create a simple board in a local folder. This is great for managing data for your own projects on a single machine.

Code

# Create a board in a subfolder named 'my_local_board'
# This folder will be created in your current working directory.
board <- board_folder("my_local_board")

You can check the path of your board.

Code

board

Pin board <pins_board_folder>
Path: 'my_local_board'
Cache size: 0

4. Pinning Data

“Pinning” data means saving an R object to your board with a specific name. Let’s pin the built-in mtcars dataset.

The pin_write() function is used for this. It takes three main arguments: - x: The R object you want to pin. - name: A unique name for the pin. - board: The board where you want to store the pin. - type: The file format to save the pin as (e.g., “rds”, “csv”, “parquet”). pins will choose a sensible default if you don’t specify.

Code

# Pin the mtcars data frame to our local board
# We can also add a description for context
pin_write(board, mtcars, name = "mtcars_data", description = "Motor Trend Car Road Tests dataset", type = "rds")

You can list the pins on your board to see what’s available.

Code

pin_list(board)

[1] "mtcars_data"

You can also search for pins with pin_search().

Code

pin_search(board, "mtcars")

# A tibble: 1 × 6
  name        type  title               created             file_size meta      
  <chr>       <chr> <chr>               <dttm>              <fs::byt> <list>    
1 mtcars_data rds   mtcars_data: a pin… 2025-07-01 13:38:30     1.19K <pins_met>

5. Reading Data from a Pin

Once data is pinned, you can easily read it back into your R session using pin_read(). This is incredibly useful for starting a new analysis without having to re-run a long data preparation script.

Code

# Read the 'mtcars_data' pin from our board
my_mtcars_data <- pin_read(board, "mtcars_data")

head(my_mtcars_data)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

6. Sharing Data with Others

The real power of pins comes from sharing data. While a local folder board is good for personal use, you can use other board types to collaborate with a team. Common choices include:

board_folder(): For a shared network drive.
board_s3(): For Amazon S3.
board_gcs(): For Google Cloud Storage.
board_connect(): For Posit Connect (formerly RStudio Connect), which is an excellent choice for teams within an organization.

The workflow remains the same regardless of the board type. For example, to pin to an S3 bucket, your code would look like this (after setting up authentication):

Code

# Connect to an S3 board (requires AWS credentials to be configured)
s3_board <- board_s3("my-team-s3-bucket")

# Write and read from the S3 board just like a local one
pin_write(s3_board, mtcars, name = "shared_mtcars")
shared_data <- pin_read(s3_board, "shared_mtcars")

This makes your data assets portable and accessible, whether you are working on your laptop, a cloud virtual machine, or a production server.

7. Versioning

pins automatically versions your data. If you write to a pin with the same name multiple times, pins will save each version. This is a powerful feature for tracking changes and ensuring reproducibility. You can always go back to a previous version of your data if needed.

Let’s modify our mtcars data and pin it again.

Code

mtcars_modified <- mtcars %>% mutate(hp_per_cyl = hp / cyl)
pin_write(board, mtcars_modified, name = "mtcars_data")

Now, you can list the available versions for the “mtcars_data” pin.

Code

pin_versions(board, "mtcars_data")

# A tibble: 1 × 3
  version                created             hash 
  <chr>                  <dttm>              <chr>
1 20250701T053830Z-9365e 2025-07-01 13:38:30 9365e

You can read a specific version by providing its version hash to pin_read().

Code

# Get the hash for the first version we saved
first_version_hash <- pin_versions(board, "mtcars_data")$version[1]

# Read the original data using the version hash
original_data <- pin_read(board, "mtcars_data", version = first_version_hash)

# Check that it doesn't have the new column
head(original_data)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_per_cyl
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   18.33333
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   18.33333
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   23.25000
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1   18.33333
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   21.87500
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1   17.50000

8. References

--- title: "Data management" author: "Tony Duan" execute: warning: false error: false format: html: toc: true toc-location: right code-fold: show code-tools: true number-sections: false code-block-bg: true code-block-border-left: "#31BAE9" --- # 1. Introduction to Data Management Effective data management is crucial for reproducible and collaborative data science. It involves organizing, storing, and sharing data in a way that is efficient, secure, and easy to manage. In R, the `pins` package provides a powerful and straightforward solution for this challenge. ![](images/my screenshots.png) The `pins` package allows you to "pin" data objects—like data frames, models, or plots—to a "board." A board is a location where you store your pins, which can be a local folder, a network drive, or a cloud service like Amazon S3, Google Cloud Storage, or Posit Connect. This makes it easy to share and access data across different projects, colleagues, or even between R and Python environments. The key idea is to treat data like a package. You can publish data to a board, and then others (or your future self) can install and use that data with a simple command, without worrying about file paths or where the data is stored. # 2. Getting Started with `pins` First, you need to install the package from CRAN. ```{r} #| eval: false install.packages("pins") ``` Then, load the necessary libraries. We will use `pins` and `tidyverse`. ```{r} library(pins) library(tidyverse) ``` # 3. Creating a Board A board is the storage location for your pins. For this example, we will create a simple board in a local folder. This is great for managing data for your own projects on a single machine. ```{r} # Create a board in a subfolder named 'my_local_board' # This folder will be created in your current working directory. board <- board_folder("my_local_board") ``` You can check the path of your board. ```{r} board ``` # 4. Pinning Data "Pinning" data means saving an R object to your board with a specific name. Let's pin the built-in `mtcars` dataset. The `pin_write()` function is used for this. It takes three main arguments: - `x`: The R object you want to pin. - `name`: A unique name for the pin. - `board`: The board where you want to store the pin. - `type`: The file format to save the pin as (e.g., "rds", "csv", "parquet"). `pins` will choose a sensible default if you don't specify. ```{r} # Pin the mtcars data frame to our local board # We can also add a description for context pin_write(board, mtcars, name = "mtcars_data", description = "Motor Trend Car Road Tests dataset", type = "rds") ``` You can list the pins on your board to see what's available. ```{r} pin_list(board) ``` You can also search for pins with `pin_search()`. ```{r} pin_search(board, "mtcars") ``` # 5. Reading Data from a Pin Once data is pinned, you can easily read it back into your R session using `pin_read()`. This is incredibly useful for starting a new analysis without having to re-run a long data preparation script. ```{r} # Read the 'mtcars_data' pin from our board my_mtcars_data <- pin_read(board, "mtcars_data") head(my_mtcars_data) ``` # 6. Sharing Data with Others The real power of `pins` comes from sharing data. While a local folder board is good for personal use, you can use other board types to collaborate with a team. Common choices include: - `board_folder()`: For a shared network drive. - `board_s3()`: For Amazon S3. - `board_gcs()`: For Google Cloud Storage. - `board_connect()`: For Posit Connect (formerly RStudio Connect), which is an excellent choice for teams within an organization. The workflow remains the same regardless of the board type. For example, to pin to an S3 bucket, your code would look like this (after setting up authentication): ```{r} #| eval: false # Connect to an S3 board (requires AWS credentials to be configured) s3_board <- board_s3("my-team-s3-bucket") # Write and read from the S3 board just like a local one pin_write(s3_board, mtcars, name = "shared_mtcars") shared_data <- pin_read(s3_board, "shared_mtcars") ``` This makes your data assets portable and accessible, whether you are working on your laptop, a cloud virtual machine, or a production server. # 7. Versioning `pins` automatically versions your data. If you write to a pin with the same name multiple times, `pins` will save each version. This is a powerful feature for tracking changes and ensuring reproducibility. You can always go back to a previous version of your data if needed. Let's modify our `mtcars` data and pin it again. ```{r} mtcars_modified <- mtcars %>% mutate(hp_per_cyl = hp / cyl) pin_write(board, mtcars_modified, name = "mtcars_data") ``` Now, you can list the available versions for the "mtcars_data" pin. ```{r} pin_versions(board, "mtcars_data") ``` You can read a specific version by providing its version hash to `pin_read()`. ```{r} # Get the hash for the first version we saved first_version_hash <- pin_versions(board, "mtcars_data")$version[1] # Read the original data using the version hash original_data <- pin_read(board, "mtcars_data", version = first_version_hash) # Check that it doesn't have the new column head(original_data) ``` # 8. References - [pins for R Official Website](https://pins.rstudio.com/) - [Getting Started with pins](https://pins.rstudio.com/articles/pins.html) - [Managing and Sharing Data with pins Blog Post](https://posit.co/blog/2023/02/13/announcing-pins-1-1-0/)

1. Introduction to Data Management

2. Getting Started with pins

3. Creating a Board

4. Pinning Data

5. Reading Data from a Pin

6. Sharing Data with Others

7. Versioning

8. References

2. Getting Started with `pins`